Sketch ^-metric: Comparing Data Streams via 

Sketching 

Research Report 

Emmanuelle Anceaume Yann Busnel 

IRISA / CNRS, Rennes, France LINA / Universite de Nantes, Nantes, France 

Emmanuelle.Anceaume@irisa.fr Yann.Busnel@univ-nantes.fr 



Abstract — In this paper, we consider the problem of estimating 
the distance between any two large data streams in small- 
space constraint. This problem is of utmost importance in 
data intensive monitoring applications where input streams are 
generated rapidly. These streams need to be processed on the fly 
and accurately to quickly determine any deviance from nominal 
behavior. We present a new metric, the Sketch k-metric, which 
allows to define a distance between updatable summaries (or 
sketches) of large data streams. An important feature of the 
Sketch k-metric is that, given a measure on the entire initial data 
streams, the Sketch k-metric preserves the axioms of the latter 
measure on the sketch (such as the non-negativity, the identity, 
the symmetry, the triangle inequality but also specific properties 
of the /-divergence). Extensive experiments conducted on both 
synthetic traces and real data allow us to validate the robustness 
and accuracy of the Sketch k-metric. 

Index Terms — Data stream; metric; randomized approxima- 
tion algorithm. 

I. Introduction 

The main objective of this paper is to propose a novel 
metric that reflects the relationships between any two discrete 
probability distributions in the context of massive data streams. 
Specifically, this metric, denoted by Sketch -k-metric, allows 
us to efficiently estimate a broad class of distances measures 
between any two large data streams by computing these 
distances only using compact synopses or sketches of the 
streams. The Sketch -k-metric is distribution-free and makes 
no assumption about the underlying data volume. It is thus 
capable of comparing any two data streams, identifying their 
correlation if any, and more generally, it allows us to acquire 
a deep understanding of the structure of the input streams. 
Formalization of this metric is the first contribution of this 
paper. 

The interest of estimating distances between any two data 
streams is important in data intensive applications. Many 
different domains are concerned by such analyses includ- 
ing machine learning, data mining, databases, information 
retrieval, and network monitoring. In all these applications, 
it is necessary to quickly and precisely process a huge 
amount of data. For instance, in IP network management, 
the analysis of input streams will allow to rapidly detect 
the presence of outliers or intrusions when changes in the 
communication patterns occur (TJ. In sensors networks, such 



an analysis will enable the correlation of geographical and 
environmental informations I0, 0. Actually, the problem 
of detecting changes or outliers in a data stream is similar 
to the problem of identifying patterns that do not conform 
to the expected behavior, which has been an active area of 
research for many decades. For instance, depending on the 
specificities of the domain considered and the type of outliers 
considered, different methods have been designed, namely 
classification-based, clustering-based, nearest neighbor based, 
statistical, spectral, and information theory. To accurately ana- 
lyze streams of data, a panel of information-theoretic measures 
and distances have been proposed to answer the specificities 
of the analyses. Among them, the most commonly used are 
the Kullback-Leibler (KL) divergence [4], the /-divergence 
introduced by Csiszar, Morimoto and Ali & Silvey 0, Q, 
Q, the Jensen-Shannon divergence and the Battacharyya 
distance |8|. More details can be found in the comprehensive 
survey of Basseville (9). 

Unfortunately, computing information theoretic measures of 
distances in the data stream model is challenging essentially 
because one needs to process streams on the fly (i.e, in one- 
pass), on huge amount of data, and by using very little storage 
with respect to the size of the stream. In addition the analysis 
must be robust over time to detect any sudden change in 
the observed streams (which may be the manifestation of 
routers deny of service attack or worm propagation). We 
tackle this issue by presenting an approximation algorithm 
that constructs a sketch of the stream from which the Sketch 
-k-metric is computed. This algorithm is a one-pass algorithm. 
It uses very basic computations, and little storage space (i.e., 
O (t (log n + k log to)) where k and t are precision parameters, 
to is an upper-bound of stream size and n the number of 
distinct items in the stream). It does not need any information 
on the size of input streams nor on their structure. This consists 
in the second contribution of the paper. 

Finally, the robustness of our approach is validated with a 
detailed experimentation study based on synthetic traces that 
range from stable streams to highly skewed ones. 

The paper is organized as follows. First, Section Ull reviews 
the related work on classical generalized metrics and their ap- 
plications on the data stream model while SectionlHlldescribes 
this model. Section [IV] presents the necessary background 
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that makes the paper self-contained. Section [V] formalizes 
the Sketch -k-metric. Section [VI] presents the algorithm that 
fairly approximates the Sketch -k-metric in one pass and 
Section IVTI1 presents extensive experiments (on both synthetic 
traces and real data) of our algorithm. Finally, we conclude in 
Section EnD 

II. Related Work 

Work on data stream analysis mainly focuses on efficient 
methods (data-structures and algorithms) to answer different 
kind of queries over massive data streams. Mostly, these 
methods consist in deriving statistic estimators over the data 
stream, in creating summary representations of streams (to 
build histograms, wavelets, and quantiles), and in comparing 
data streams. Regarding the construction of estimators, a 
seminal work is due to Alon et al. 1101 . The authors have 
proposed estimators of the frequency moments Fk of a stream, 
which are important statistical tools that allow to quantify 
specificities of a data stream. Subsequently, a lot of attention 
has been paid to the strongly related notion of the entropy 
of a stream, and all notions based on entropy (i.e., norm and 
relative entropy) ifTTI . These notions are essentially related to 
the quantification of the amount of randomness of a stream 
(e.g, HI, Gl, 03, US, ESI, ED). The construction of 
synopses or sketches of the data stream have been proposed 
for different applications (e.g, (HI, fljD, EOl). 

Distance and divergence measures are key measures in 
statistical inference and data processing problems [9]. There 
exists two largely used broad classes of measures, namely the 
/-divergences and the Bregman divergences. Among them, 
there exists two classical distances, namely the Kullback- 
Leibler (KL) divergence and the Hellinger distance, that are 
very important to quantify the amount of information that 
separates two distributions. In |16|, the authors have proposed 
a one pass algorithm for estimating the KL divergence of an 
observed stream compared to an expected one. Experimental 
evaluations have shown that the estimation provided by this 
algorithm is accurate for different adversarial settings for 
which the quality of other methods dramatically decreases. 
However, this solution assumes that the expected stream is the 
uniform one, that is a fully random stream. Actually in |2"T1 . 
the authors propose a characterization of the information 
divergences that are not sketchable. They have proven that any 
distance that has not "norm-like" properties is not sketchable. 
In the present paper, we go one step further by formalizing 
a metric that allows to efficiently and accurately estimate a 
broad class of distances measures between any two large data 
streams by computing these distances uniquely on compact 
synopses or sketches of streams. 

III. Data Stream Model 

We consider a system in which a node P receives a large 
data stream a — di, aa, • • • , a m of data items that arrive 
sequentially. In the following, we describe a single instance of 
P, but clearly multiple instances of P may co-exist in a system 
(e.g., in case P represents a router, a base station in a sensor 



network). Each data item i of the stream a is drawn from the 
universe fl = {1, 2, . . . , n) where n is very large. Data items 
can be repeated multiple times in the stream. In the following 
we suppose that the length m of the stream is not known. Items 
in the stream arrive regularly and quickly, and due to memory 
constraints, need to be processed sequentially and in an online 
manner. Therefore, node P can locally store only a small 
fraction of the items and perform simple operations on them. 
The algorithms we consider in this work are characterized by 
the fact that they can approximate some function on a with a 
very limited amount of memory. We refer the reader to [22] for 
a detailed description of data streaming models and algorithms. 

IV. Information Divergence of Data Streams 

We first present notations and background that make this 
paper self-contained. 

A. Preliminaries 

A natural approach to study a data stream a is to model 
it as an empirical data distribution over the universe ft, given 
by (pi,P2, ■ • • ,Pn) with pi = Xi/m, and x t = \{j : a 3 = i}\ 
representing the number of times data item i appears in a. We 

have m = T, ie n x i- 

1) Entropy: Intuitively, the entropy is a measure of the 
randomness of a data stream a. The entropy H(a) is minimum 
(i.e., equal to zero) when all the items in the stream are the 
same, and it reaches its maximum (i.e., log 2 m) when all 
the items in the stream are distinct. Specifically, we have 
H(a) = — J2ien Pi^°S2 Pi- The log is to the base 2 and 
thus entropy is expressed in bits. By convention, we have 
OlogO — 0. Note that the number of times Xi item i appears 
in a stream is commonly called the frequency of item i. The 
norm of the entropy is defined as Fjj = 2~2ien Xi ^°S x i- 

2) 2-universal Hash Functions: In the following, we in- 
tensively use hash functions randomly picked from a 2- 
universal hash family. A collection H of hash functions 
h : {1, . . . , Al} — > {0, . . . , M'} is said to be 2-universal if 
for every h € H and for every two different items i,j £ [M], 
¥{h(i) = h(j)} < -p7, which is exactly the probability of 
collision obtained if the hash function assigned truly random 
values to any i <E [M]. In the following, notation [M] means 
{1,...,M}. 

B. Metrics and divergences 

1) Metric definitions: The classical definition of a metric 
is based on a set of four axioms. 

Definition 1 (Metric) Given a set X, a metric is a function 
d : X x X — > M such that, for any x,y, z S X, we have: 

Non-negativity: d(x,y)>0 (1) 

Identity of indiscernibles: d(x, y) — x = y (2) 

Symmetry: d{x, y) = d(y, x) (3) 

Triangle inequality: d(x,y) < d(x,z) + d(z,y) (4) 

In the context of information divergence, usual distance 
functions are not precisely metric. Indeed, most of divergence 



3 



functions do not verify the 4 axioms, but only a subset of them. 
We recall hereafter some definitions of generalized metrics. 

Definition 2 (Pseudometric) Given a set X, a pseudometric 

is a function that verifies the axioms of a metric with the 
exception of the identity of indiscernible, which is replaced by 

Vx e x,d{ x ,x) = o. 

Note that this definition allows that d(x, y) = for some 
x 7^ y in X. 

Definition 3 (Quasimetric) Given a set X, a quasimetric is 

a function that verifies all the axioms of a metric with the 
exception of the symmetry fcf. Relation\3}. 

Definition 4 (Semimetric) Given a set X, a semimetric is 

a function that verifies all the axioms of a metric with the 
exception of the triangle inequality fcf. Relation®. 

Definition 5 (Premetric) Given a set X, a premetric is 

a pseudometric that relax both the symmetry and triangle 
inequality axioms. 

Definition 6 (Pseudoquasimetric) Given a set X, a pseu- 
doquasimetric is a function that relax both the identity of 
indiscernible and the symmetry axioms. 

Note that the latter definition simply corresponds to a 
premetric satisfying the triangle inequality. Remark also that 
all the generalized metrics preserve the non-negativity axiom. 

2) Divergences: We now give the definition of two broad 
classes of generalized metrics, usually denoted as divergences. 

a) f -divergence: Mostly used in the context of statistics 
and probability theory, a /-divergence Vf is a premetric that 
guarantees monotonicity and convexity. 

Definition 7 (/-divergence) Let p and q be two VL-point 
distributions. Given a convex function / : (0, oo) — > K such 
that /(l) = 0, the f -divergence of q from p is: 

2>/(pII?) = 

where by convention 0/(jj) = 0, ) = alim u _,.o f{u), and 
) = a li m u-s.oo f{ u )l u if these limits exist. 
Following this definition, any /-divergence verifies both 
monotonicity and convexity. 

Property 8 (Monotonicity) Given k an arbitrary transition 
probability that respectively transforms two Q-point distribu- 
tions p and q into p K and q K , we have: 

Vf(p\\q)>Vf(p K \\q K ). 

Property 9 (Convexity) Let p\, p2, qi and q 2 be four U-point 
distributions. Given any X £ [0, 1], we have: 

V f (Api + (1 - A)pa||A<Zi + (1 - X)q 2 ) 
< XV f ( Pl \\ qi ) + (1 - X)V f (p 2 \\q 2 ). 

This class of divergences has been introduced in indepen- 
dent works by chronologically Csiszar, Morimoto and Ali & 
Silvey O, |6|, Q. All the distance measures in the so-called 
Ali-Silvey distances are applicable to quantifying statistical 
differences between data streams. 



b) Bregman divergence: Initially proposed in |23|, this 
class of generalized metrics encloses quasimetrics and semi- 
metrics, as these divergences do not satisfy the triangle in- 
equality nor symmetry. 

Definition 10 (Bregman divergence) Given F a continuously- 
differentiable and strictly convex function defined on a closed 
convex set C, the Bregman divergence associated with F for 
p,q G C is defined as 

B F (p\\q) = F(p) - F(q) - (VF(q), (p - q)) . 

where the operator (•, •) denotes the inner product. 

In the context of data stream, it is possible to reformulate 
this definition according to probability theory. Specifically, 

Definition 11 (Decomposable Bregman divergence) Let p and 
q be two tt-point distributions. Given a strictly convex function 
F : (0, 1] — > K, the Bregman divergence associated with F 
of q from p is defined as 

B F {p\\q) = ]T (Fipi) - F( qi ) - (pi - qi )F'( qi )) . 

Following these definitions, any Bregman divergence ver- 
ifies non-negativity and convexity in its first argument, but 
not necessarily in the second argument. Another interesting 
property is given by thinking of the Bregman divergences as 
an operator of the function F. 

Property 12 (Linearity) Let Fi and F 2 be two strictly convex 
and differentiable functions. Given any X G [0, 1], we have 
that 

B Fl +\F 2 {v\\l) = B Fl {p\\q) + XB F2 (p\\q). 

3) Classical metrics: In this section, we present several 
commonly used metrics in fi-point distribution context. These 
specific metrics are used in the evaluation part presented in 
Section |W| 

a) Kullback-Leibler divergence: The Kullback-Leibler 
(KL) divergence [4], also called the relative entropy, is a robust 
metric for measuring the statistical difference between two 
data streams. The KL divergence owns the special feature 
that it is both a /-divergence and a Bregman one (with 
f(t) = F(t) = t\ogt). 

Given p and q two il-point distributions, the Kullback- 
Leibler divergence is then defined as 

Vkl(p\\q) - lo S ~=H(p,q)- Hip), (5) 

iGfi Qi 

where H(p) = —2~2Pi^°EPi is trie (empirical) entropy of p 
and H(p,q) = Pi l°g Qi i s tne cross entropy of p and q. 

b) Jensen-Shannon divergence: The Jensen-Shannon di- 
vergence (JS) is a symmetrized and smoothed version of 
the Kullback-Leibler divergence. Also known as information 
radius (IRad) or total divergence to the average, it is defined 

1 1 

Vjs(j?\\q) = ^D KL (p\\l) + -D KL (q\\£), (6) 

where £ — \(p+q). Note that the square root of this divergence 
is a metric. 
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c) Bhattacharyya distance: The Bhattacharyya distance 
is derived from his proposed measure of similarity between 
two multinomial distributions, known as the Bhattacharya 
coefficient (BC) 0. It is defined as 

V B (p\\q) = -log(BC(p,?)) where BC(p,q) = £ ^plql. 

This distance is a semimetric as it does not verify the triangle 
inequality. Note that the famous Hellinger distance [24] equal 
to yl — BC(p, q) verifies it. 

V. Sketch ^-metric 

We now present a method to sketch two input data streams 
o\ and (72, and to compute any generalized metric (f> between 
these sketches such that this computation preserves all the 
properties of <fi computed on o\ and a?,. Proof of correctness 
of this method is presented in this section. 

Definition 13 (Sketch ^-metric) Let p and q be any two n- 

point distributions. Given a precision parameter k, and any 
generalized metric <f> on the set of all tt-point distributions, 
there exists a Sketch -k-metric fa defined as follows 

<t>k(p\\q) = max fap p \\q p ) with Va G p, p p (a) =Vp(i), 

where 'Pfe(n) is the set of all partitions of an Q-element set 
into exactly k nonempty and mutually exclusive cells. 

Remark 14 Note that for k > Q, it does not exist a partition 
of Q into k nonempty parts. By convention, we consider that 
'PkipWq) — 4>(p\\Q) m this specific context. 

In this section, we focus on the preservation of axioms 
and properties of a generalized metric <f> by the corresponding 
Sketch -k-metric fa. 

A. Axioms preserving 

Theorem 15 Given any generalized metric <f> then, for any 
k G ft, the corresponding Sketch -k-metric fa preserves all 
the axioms of <fi. 

Proof: The proof is directly derived from Lemmata [16] 
El US and Qjj] ■ 

Lemma 16 (Non-negativity) Given any generalized metric <f> 
verifying the Non-negativity axiom then, for any k G 0, the 
corresponding Sketch -k-metric fa preserves the Non-negativity 
axiom. 

Proof: Let p and q be any two Sl-point distributions. By 
definition, 

fa{p\\q)= max ■ fap p \\q p ) 
P ev k (n) 

As for any two fc -point distributions, <j> is positive we have 
4>k{p\\q) > that concludes the proof. ■ 

Lemma 17 (Identity of indiscernible) Given any generalized 
metric <f> verifying the Identity of indiscernible axiom then, for 
any k G O, the corresponding Sketch -k-metric fa preserves 
the Identity of indiscernible axiom. 



Proof: Let p be any il-point distribution. We have 

fa (p\ \p) = max fap p | \p p ) = 0, 
pev k (n) 

due to (j> Identity of indiscernible axiom. 

Consider now two f2-point distributions p and q such that 
0fe(p||'?) — 0- Metric verifies both the non-negativity axiom 
(by construction) and the Identity of indiscernible axiom (by 
assumption). Thus we have Vp G VkityiPf, = q P , leading to 



V P eP fe (r!),Vaep,^p( i ) = ^ 9 (z) 

> £ a 



(7) 



Moreover, for any i G O, there exists a partition p G "Pfe(fi) 
such that {i} G p. By Equation[7] Vi G fl,p(i) — q(i), and so 
p = q. 

Combining the two parts of the proof leads to (j>k{p\\q) = 
<^=> p = q, which concludes the proof of the Lemma. ■ 

Lemma 18 (Symmetry) Given any generalized metric <j) 
verifying the Symmetry axiom then, for any k € f2, the 
corresponding Sketch -k-metric fa preserves the Symmetry 
axiom. 

Proof: Let p and q be any two £!-point distributions. We 

have 

4>k(p\\q) 



P ev k (n) 



Let p £ 'Pfc(O) be a fc-cell partition such that fapp^qp) = 
ma,x peVk(n) fap p \\q p ). We get 

<j>k(p\\q) = <KPp||5p) = </>(qp\\Pp) < <i>k(q\\p)- 

By symmetry, considering p G "Pfc(^) such that faq p \\p p ) = 

maXp ePfc (n) faq p \\p p ), we also have fa(q\\p) < fa{p\\q), 
which concludes the proof. ■ 

Lemma 19 (Triangle inequality) Given any generalized met- 
ric <f> verifying the Triangle inequality axiom then, for any 
k G ri, the corresponding Sketch -k-metric (f>k preserves the 
Triangle inequality axiom. 

Proof: Let p, q and r be any three £!-point distributions. 
Let p G Vki^) be a fc-cell partition such that fappWqp-) = 

max p67Mn) HPpWQp)- 
We have 

4>k{p\\q) = fapp-Wq-p-) 

< $(Pp\\ t p) + ^( T p\\qp-) 

< max fap p \\r p )+ max far p \\q p ) 

= fa(p\\r) + fa(r\\q) 
that concludes the proof. ■ 
B. Properties preserving 

Theorem 20 Given a f -divergence <j) then, for any k G fl, 
the corresponding Sketch -k-metric </>fc is also a f -divergence. 

Proof: From Theorem[l5] fa preserves the axioms of the 
generalized metric. Thus, fa and <j> are in the same equivalence 
class. Moreover, from Lemma[22] fa verifies the monotonicity 
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property. Thus, as the /-divergence is the only class of 
decomposable information monotonic divergences (cf. (25)), 
fa is also a /-divergence. ■ 

Theorem 21 Given a Bregman divergence <j) then, for any k G 
f2, the corresponding Sketch -k-metric (pk is also a Bregman 
divergence. 

Proof: From Theorem[l5] 4>k preserves the axioms of the 
generalized metric. Thus, fa and <fi are in the same equivalence 
class. Moreover, the Bregman divergence is characterized by 
the property of transitivity (cf. [26 1) defined as follows. Given 
p, q and r three 17-point distributions such that q = IT(L|r) and 
p G L, with II is a selection rule according to the definition of 
Csiszar in |26| and L is a subset of the Sl-point distributions, 
we have the Generalized Pythagorean Theorem: 

0(p||?) + #?ll»O = 0(p|k)- 

Moreover the authors in 11271 show that the set S n of all 
discrete probability distributions over n elements (X = 
{xi, . . . , x n }) is a Riemannian manifold, and it owns another 
different dually fiat affine structure. They also show that 
these dual structures give rise to the generalized Pythagorean 
theorem. This is verified for the coordinates in S n and for 
the dual coordinates 11271 . Combining these results with the 
projection theorem [26 1, [27], we obtain that 

<Pk(p\\r)= max fap p \\r p ) 

p£V k (n) 

= max (fap p \\q p ) + faq p \\r p )) 

p€V k {n) 

= max fap p \\q p ) + max <p(q p \\r p ) 
p£V k (n) pev k (n) 

= Mp\\q)+Mq\\r) 

Finally, by the characterization of Bregman divergence through 
transitivity [26|, and reinforced with Lemma l24l statement. fa 
is also a Bregman divergence. ■ 
In the following, we show that the Sketch -k-metric preserves 
the properties of divergences. 

Lemma 22 (Monotonicity) Given any generalized metric <j> 
verifying the Monotonicity property then, for any k G Ct, the 
corresponding Sketch -k-metric fa preserves the Monotonicity 
property. 

Proof: Let p and q be any two £!-point distributions. 
Given c < N, consider a partition p G *P C (^)- As cf> is 
monotonic, we have </>(p||g) > fap p \\q p ) [28]. We split the 
proof into two cases: 

Case (1). Suppose that c > k. Computing fa{$u\\qp) 
amounts in considering only the fc-cell partitions p G Vk{Q) 
that verify 

V6 G p, 3a G p, b C a. 

These partitions form a subset of Pfc(O). The maximal value of 
(f>{p P \\q P ) over this subset cannot be greater than the maximal 
value over the whole Vk(ty- Thus we have 

<t>k(p\\q) = max <p(p p \\q p ) > fa($p\\qp). 
pev k (n) 



Case (2). Suppose now that c < k. By definition, we have 
4>k(Pp\\<lp) = 4>{Pp\\qp)- Consider p' G Pfe(O) such that V& G 
[i,3a G p',b C a. It then exists a transition probability that 
respectively transforms p p > and q p > into p p and q p . As <f> is 
monotonic, we have 

MpWq) = max c6(pp||<f p ) 
pev k {tt) 

> HPp'Mp') 

> ^(p^Wq^) = <MppJI<7/J- 

Finally for any value of c, fa guarantees the monotonicity 
property. This concludes the proof. ■ 

Lemma 23 (Convexity) Given any generalized metric <j) 
verifying the Convexity property then, for any k G £1, the 
corresponding Sketch -k-metric (f>k preserves the Convexity 
property. 

Proof: Let p\, p2, qi and 172 be any four ft -point distri- 
butions. Given any A G [0, 1], we have: 

fa (Api + (1 - A)p 2 ||Agi + (1 - X)q 2 ) 

= p ™* n) <f> ( A ^ip + 0-- X)P2 P \\^qip + (1 - A )<72 P ) 

Let p' G Vk{$l) such that 

(Xp lp , + (1 - X)p 2p ,\\Xqi p , + (1 - X)q 2p ,) 
= J^Xj, ^ ( A ^ip + i 1 ~ A )P2 p ||A(fi p + (1 - A)^ p ) . 

As cf) verifies the Convex property, we have: 

fa {Xpi + (1 - A)p 2 ||Agi + (1 - X)q 2 ) 
= cf> (Xp lp , + (1 - X)f 2p ,\\Xqi p , + (1 - X)q 2pl ) 
< Xfap lpl \\q lp ,) + (1 - X)fap 2p ,\\q 2p ,) 



- A 'S, 8 ^ ^(Pipll^ip) + C 1 ~ A ) n l a * HP2p\\02 p ) 
= Xfa(pi\\qi) + (1 - X)fa(p 2 \\q 2 ) 

that concludes the proof. ■ 

Lemma 24 (Linearity) The Sketch -k-metric definition pre- 
serves the Linearity property. 

Proof: Let F\ and F 2 be two strictly convex and dif- 
ferentiate functions, and any A G [0,1]. Consider the three 
Bregman divergences generated respectively from Fi, F 2 and 
Fi + XF 2 . 

Let p and q be two 57-point distributions. We have: 

^F 1 +\F 2k (p\\q) = max B Fl+ xF 2 {pp\\q P ) 
p£V k (u) 

= max (B Fl {p P \\q P ) + XBp 2 (p P \\q P )) 

pGV k (n) 

As Fi and F 2 two strictly convex functions, and taken a 
leaf out of the Jensen's inequality, we have: 

b Pi Mq) + *b**M\q) 

< max (B Fl {p p \\q p ) + XBp 2 (p p \\q p )) 
p£V k (n) 



B 



-Fi+A_F 2 
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that conclude the proof. ■ 
This concludes the proof that the Sketch -k-metric preserves 
all the axioms of a metric as well as the properties of f- 
divergences and Bregman divergences. We now show how to 
efficiently implement such a metric. 

VI. Approximation algorithm 

In this section, we propose an algorithm that computes 
the Sketch -k-metric in one pass on the stream. By definition 
of the metric (cf., Definition [151 . we need to generate all 
the possible fc-cell partitions. The number of these partitions 
follows the Stirling numbers of the second kind, which is equal 
to S(n,k) = Tg£* =Q (-l) k-J '(*)j' n , where n is the size of 
the items universe. Therefore, S(n, fc) grows exponentially 
with n. Unfortunately, n is very large. As the generating 
function of S(n, k) is equivalent to x n , it is unreasonable 
in term of space complexity. We show in the following that 
generating t = [log(l/<j)] random fc-cell partitions, where 
5 is the probability of error of our randomized algorithm, is 
sufficient to guarantee good overall performance of our metric. 

Our algorithm is inspired from the Count-Min Sketch al- 
gorithm proposed in |[T9l by Cormode and Muthukrishnan. 
Specifically, the Count-Min algorithm is an (e, (^-approxi- 
mation algorithm that solves the frequency-estimation prob- 
lem. For any items in the input stream a, the algorithm 
outputs an estimation f v of the frequency of item v such 
that ¥{\f v — f v \ > £fv} < S, where e,S > are given as 
parameters of the algorithm. The estimation is computed by 
maintaining a two-dimensional array C of t x fc counters, and 
by using t 2-universal hash functions hi (1 < i < t), where 
fc = 2/e and t = [log(l/(5)]. Each time an item v is read 
from the input stream, this causes one counter of each line to 
be incremented, i.e., C[hi(v)} is incremented by one for each 

To compute the Sketch ^-metric of two streams <j\ and 
o 2, two sketches o\ and <72 of these streams are constructed 
according to the above description. Note that there is no 
particular assumption on the length of both streams o\ and 
(72. That is their respective length is finite but unknown. By 
construction of the 2-universal hash functions hi (1 < i < t), 
each line of C ai and C ai corresponds to one partition pi of the 
TV-point empirical distributions of both o\ and (72 . Thus when 
a query is issued to compute the given distance tfi between 
these two streams, the maximal value over all the t partitions 
Pi of the distance <f> between a-y and (To is returned, i.e., 
the distance cf) applied to the i th lines of C CTl and C ai for 
1 < i < t. Figure Q] presents the pseudo-code of our algorithm. 

Lemma 25 Given parameters k and t, Algorithm Q] 
gives an approximation of the Sketch -k-metric, using 
O (i (log n + k log to)) bits of space. 

Proof: The matrices C ai , for any i € {1,2}, are com- 
posed of t x fc counters, which uses O (logm). On the other 
hand, with a suitable choice of hash family, we can store the 
hash functions above in O(ilogn) space. ■ 



Figure 1. Sketch ^-metric algorithm 

Input: Two input streams o\ and a%\ the distance 

and t settings; 
Output: The distance <fi between a\ and (72 



l Choose t functions h 
hash function family; 
C CT1 [l...*][l...fc] -5-0; 
C oa [l...t][l...fe] ^0; 
for a, e (7i do 



[fc], each from a 2-universal 



v = a 3 ; 

for i = 1 to t do 

L C a Mh*{v)] <- C Vl \i][hi(v)] 



1; 



8 for a, e cr 2 do 



9 

10 
11 



w = ay, 

for i = 1 to t do 

L C n ]i][hi(w)] ■ 



C a2 \i][KH] + 1; 



12 On query cj) k (<Ji\\<j2) return 

cj) = maxi< i < t 0(C (Jl [i][-],C CT2 [i][-]); 



VII. Performance Evaluation 

We have implemented our Sketch -k-metric and have con- 
ducted a series of experiments on different types of streams 
and for different parameters settings. We have fed our algo- 
rithm with both real-world data and synthetic traces. Real data 
give a realistic representation of some real systems, while 
the latter ones allow to capture phenomenon which may be 
difficult to obtain from real-world traces, and thus allow to 
check the robustness of our metric. We have varied all the 
significant parameters of our algorithm, that is, the maximal 
number of distinct data items n in each stream, the number 
of cells fc of each generated partition, and the number of 
generated partitions t. For each parameters setting, we have 
conducted and averaged 100 trials of the same experiment, 
leading to a total of more than 300, 000 experiments for the 
evaluation of our metric. Real data have been downloaded 
from the repository of Internet network traffic |29|. We have 
used five traces among the available ones. Two of them 
represent two weeks logs of HTTP requests to the Internet 
service provider ClarkNet WWW server - ClarkNet is a full 
Internet access provider for the Metro Baltimore- Washington 
DC area - the other two ones contain two months of HTTP 
requests to the NASA Kennedy Space Center WWW server, 
and the last one represents seven months of HTTP requests to 
the WWW server of the University of Saskatchewan, Canada. 
In the following these data sets will be respectively referred to 
as ClarkNet, NASA, and Saskatchewan traces. Table [[[presents 
the statistics of these data traces, in term of stream size (cf. "# 
items" in the table), number of distinct items in each stream 
(cf. "# distinct items") and the number of occurrences of the 
most frequent item (cf. "max. freq."). For more information on 
these data traces, an extensive analysis is available in [30]. We 
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Dtitci tr3.cc 


ff items 


$ distinct items 


m3.x frecj 


NASA (July) 


1,891,715 


81,983 


17,572 


NASA (August) 


1,569,898 


75,058 


6,530 


ClarkNet (August) 


1,654,929 


90,516 


6,075 


ClarkNet (September) 


1,673,794 


94,787 


7,239 


Saskatchewan 


2,408,625 


162,523 


52,695 



Table I 

Statistics of real data traces. 
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NASA (July) — 
NASA (August) — 
ClarkNet (August) 

ClarkNet (September) 

University of Saskatchewan — 




1 

1 10 100 1000 10000 100000 le+06 

Figure 2. Logscale distribution of frequencies for each real data trace. 



have evaluated the accuracy of our metric by comparing for 
each data set (real and synthetic), the results obtained with our 
algorithm on the stream sketches (referred to as Sketch in the 
legend) and the ones obtained on full streams (referred to as 
Ref distance in the legend). That is, for each couple of input 
streams, and for each generalized metric <f>, we have computed 
both the exact distance between the two streams and the one 
as generated by our metric. We now present the main lessons 
drawn from these experiments. 

Figure [3] and |4] shows the accuracy of our metric as a func- 
tion of the different input streams and the different generalized 
metrics applied on these streams. All the histograms shown 
in Figures |3(a)f|4(e)| share the same legend, but for readability 
reasons, this legend is only indicated on histogram [3(c)] Three 
generalized metrics have been used, namely the Bhattacharyya 
distance, the Kullback-Leibler and the Jensen-Shannon diver- 
gences, and five distribution families denoted by p and q have 
been compared with these metrics. 

Let us focus on synthetic traces. The first noticeable re- 
mark is that our metric behaves perfectly well when the two 
compared streams follow the same distribution, whatever the 
generalized metric cj> used (cf., Figure |3(a)| with the uniform 
distribution, Figures [3(b)| |3(d)| and |3(f)| with the Zipf distribu- 
tion, Figure [3(c)| with the Pascal distribution, Figure [3(e)| with 
the Binomial distribution, and Figure 3(g) with the Poisson 
one). This result is interesting as it allows the sketch *- 
metric to be a very good candidate as a parametric method for 
making inference about the parameters of the distribution that 
follow an input stream. The general tendency is that when the 
distributions of input streams are close (e.g, Zipf distribution 
with different parameter, Pascal and the Zipf with a = 4), then 
applying the generalized metrics </> on sketches give a good 



estimation of the distance as computed on the full streams. 

Now, when the two input distributions exhibit a totally 
different shape, this may have an impact on the precision of 
our metric. Specifically, let us consider as input distributions 
the Uniform and the Pascal distributions (see Figure |3(a)| 
and |3(c)| i. Sketching the Uniform distribution leads to fc-cell 
partitions whose value is well distributed, that is, for a given 
partition all the k cell values have with high probability the 
same value. Now, when sketching the Pascal distribution, the 
repartition of the data items in the cells of any given partitions 
is such that a few number of data items (those with high 
frequency) populate a very few number of cells. However, 
the values of these cells is very large compared to the other 
cells, which are populated by a large number of data items 
whose frequency is small. Thus the contribution of data items 
exhibiting a small frequency and sharing the cells of highly 
frequent items will be biased compared to the contribution of 
the other items. This explains why the accuracy of the sketch 
★-metric is slightly lowered in these cases. 

We can also observe the strong impact of the non-symmetry 
of the Kullback-Leibler divergence on the computation of the 
distance (computed on full streams or on sketches) with a clear 
influence when the input streams follow a Pascal and Zipf with 
a = 1 distributions (see Figures |3(c)| and |3(b)[ ). 

Finally, Figure |3(h)| summarizes the good properties of our 
method whatever the input streams to be compared and the 
generalized metric <fr used to do this comparison. 

The same general remarks hold when considering real data 
sets. Indeed, Figure |4] shows that when the input streams 
are close to each other, which is the case for both (July 
and August) NASA and (August and September) ClarkNet 
traces (cf. Figure |2), then applying the generalized metrics 
(j> on sketches gives good results w.r.t. full streams. When 
the shapes of the input streams are different (which is the 
case for Saskatchewan w.r.t. the 4 other input streams), the 
accuracy of the sketch ★-metric decreases a little bit but in a 
small proportion. Notice that the scales on the y-axis differ 
significantly in Figure [3] and in Figure [4] 

Figure presents the impact of the number of cells per 
generated partition on the accuracy of our metric on both syn- 
thetic traces and real data. It clearly shows that, by increasing 
k, the number of data items per cell in the generated partition 
shrinks and thus the absolute error on the computation of the 
distance decreases. The same feature appears when the number 
n of distinct data items in the stream increases. Indeed, when 
n increases (for a given fc), the number data items per cell 
augments and thus the precision of our metric decreases. This 
gives rise to a shift of the inflection point, as illustrated in 
Figure |5(b)| due to the fact that data sets have almost twenty 
times more distinct data items than the synthetic ones. As 
aforementioned, the input streams exhibit very different shapes 
which explain the strong impact of k. Note also that k has the 
same influence on the Sketch -k-metric for all the generalized 
distances 6. 
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(e) p = Binomial distribution with p = 0.5 



(g) p = Poisson distribution with p ■■ 





(b) p = Zipf distribution with a = 1 




(d) p = Zipf distribution with « = 2 




(f) p = Zipf distribution with a = 4 
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(h) p = Uniform distribution and q = Pascal distribution, as a function 



of its parameter r (p - 



2r + n 



)■ 



Figure 3. Sketch it-metric accuracy as a function of p and q (or r for |3(h)) . Parameters setting is as follows: m = 200, 000; n = 4, 000; k = 200; t = 4 
where m represents the size of the stream, n the number of distinct data items in the stream, t the number of generated partitions and k the number of cells 
per generated partition. 
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(a) p = NASA webserver (August) 
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(b) p = NASA webserver (July) 



(c) p = ClarkNet webserver (August) 

0.25 



(d) p = ClarkNet webserver (September) 



(e) p = Saskatchewan University webserver 
Figure 4. Sketch -k-metric accuracy as a function of real data traces. Parameters setting is as follows: k = 2, 000; t = 4. 
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Sketch -k-metric accuracy as a function of k. We have 
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(b) Sketch ^-metric accuracy between data trace extracted from ClarkNet- 
work (August) and Saskatchewan University, as a function of k 

Figure 5. Sketch ^-metric be tween th e Uni form distribution and Pascal with 
parameter p = 2r "^ n (Figures [5(a) | and |6(a)) , and between data trace extracted 
from ClarkNetwork (August) and Saskatchewan University (Figures |5(b)| 
and [6(b)). 



Figure [6] shows the slight influence of the number t of 
generated partitions on the accuracy of our metric. The reason 
comes from the use of 2-universal hash functions, which 
guarantee for each of them and with high probability that 
data items are uniformly distributed over the cells of any 
partition. As a consequence, augmenting the number of such 
hash functions has a weak influence on the accuracy of the 
metric. Figure [7] focuses on the error made on the Sketch *- 
metric for five different values of t as a function of parameter 
r of the Pascal distribution (recall that increasing values of 
r - while maintaining the mean value - makes the shape 
of the Pascal distribution flatter). Figures |7(b)| |7(d)| and |7(f)| 
respectively depict for each value of t the difference between 
the reference and the sketch values which makes more visible 
the impact of t. The same main lesson drawn from these 
figures is the moderate impact of t on the precision of our 
algorithm. 

VIII. Conclusion and open issues 

In this paper, we have introduced a new metric, the Sketch 
-k-metric, that allows to compute any generalized metric <p on 
the summaries of two large input streams. We have presented a 
simple and efficient algorithm to sketch streams and compute 



this metric, and we have shown that it behaves pretty well 
whatever the considered input streams. We are convinced of 
the undisputable interest of such a metric in various domains 
including machine learning, data mining, databases, informa- 
tion retrieval and network monitoring. 

Regarding future works, we plan to characterize our metric 
among Renyi divergences Oil , also known as a-divergences, 
which generalize different divergence classes. We also plan 
to consider a distributed setting, where each site would be in 
charge of analyzing its own streams and then would propagate 
its results to the other sites of the system for comparison or 
merging. An immediate application of such a "tool" would 
be to detect massive attacks in a decentralized manner {e.g., 
by identifying specific connection profiles as with worms 
propagation, and massive port scan attacks or by detecting 
sudden variations in the volume of received data). 
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from ClarkNetwork (August) and Saskatchewan University (Figures |5(b)| 
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Figure 7. Sketch ^-metric estimation between Uniform distribution 
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(f) Difference with Jensen-Shannon divergence 
and Pascal with parameter p = 2r ^ n , as a function of k, t and r. 



